Goto

Collaborating Authors

 comprehensive assessment


DecodingTrust: A Comprehensive Assessment of Trustworthiness in GPT Models

Neural Information Processing Systems

Generative Pre-trained Transformer (GPT) models have exhibited exciting progress in capabilities, capturing the interest of practitioners and the public alike. Yet, while the literature on the trustworthiness of GPT models remains limited, practitioners have proposed employing capable GPT models for sensitive applications to healthcare and finance - where mistakes can be costly. To this end, this work proposes a comprehensive trustworthiness evaluation for large language models with a focus on GPT-4 and GPT-3.5, considering diverse perspectives - including toxicity, stereotype bias, adversarial robustness, out-of-distribution robustness, robustness on adversarial demonstrations, privacy, machine ethics, and fairness. Based on our evaluations, we discover previously unpublished vulnerabilities to trustworthiness threats. For instance, we find that GPT models can be easily misled to generate toxic and biased outputs and leak private information in both training data and conversation history. We also find that although GPT-4 is usually more trustworthy than GPT-3.5 on standard benchmarks, GPT-4 is more vulnerable given jailbreaking system or user prompts, potentially due to the reason that GPT-4 follows the (misleading) instructions more precisely. Our work illustrates a comprehensive trustworthiness evaluation of GPT models and sheds light on the trustworthiness gaps. Our benchmark is publicly available at https://decodingtrust.github.io/.


DLiPath: A Benchmark for the Comprehensive Assessment of Donor Liver Based on Histopathological Image Dataset

Pan, Liangrui, Li, Xingchen, Chen, Zhongyi, Chu, Ling, Peng, Shaoliang

arXiv.org Artificial Intelligence

Pathologists comprehensive evaluation of donor liver biopsies provides crucial information for accepting or discarding potential grafts. However, rapidly and accurately obtaining these assessments intraoperatively poses a significant challenge for pathologists. Features in donor liver biopsies, such as portal tract fibrosis, total steatosis, macrovesicular steatosis, and hepatocellular ballooning are correlated with transplant outcomes, yet quantifying these indicators suffers from substantial inter- and intra-observer variability. To address this, we introduce DLiPath, the first benchmark for comprehensive donor liver assessment based on a histopathology image dataset. We collected and publicly released 636 whole slide images from 304 donor liver patients at the Department of Pathology, the Third Xiangya Hospital, with expert annotations for key pathological features (including cholestasis, portal tract fibrosis, portal inflammation, total steatosis, macrovesicular steatosis, and hepatocellular ballooning). We selected nine state-of-the-art multiple-instance learning (MIL) models based on the DLiPath dataset as baselines for extensive comparative analysis. The experimental results demonstrate that several MIL models achieve high accuracy across donor liver assessment indicators on DLiPath, charting a clear course for future automated and intelligent donor liver assessment research. Data and code are available at https://github.com/panliangrui/ACM_MM_2025.


A System for Comprehensive Assessment of RAG Frameworks

Rengo, Mattia, Beadini, Senad, Alfano, Domenico, Abbruzzese, Roberto

arXiv.org Artificial Intelligence

--Retrieval Augmented Generation (RAG) has emerged as a standard paradigm for enhancing the factual accuracy and contextual relevance of Large Language Models (LLMs) by integrating retrieval mechanisms. However, existing evaluation frameworks fail to provide a holistic black-box approach to assessing RAG systems, especially in real-world deployment scenarios. T o address this gap, we introduce SCARF (System for Comprehensive Assessment of RAG Frameworks), a modular and flexible evaluation framework designed to benchmark deployed RAG applications systematically. SCARF provides an end-to-end, black-box evaluation methodology, enabling a limited-effort comparison across diverse RAG frameworks. Our framework supports multiple deployment configurations and facilitates automated testing across vector databases and LLM serving strategies, producing a detailed performance report. Using the REST APIs interface, we demonstrate how SCARF can be applied to real-world scenarios, showcasing its flexibility in assessing different RAG frameworks and configurations. SCARF is available at GitHub repository.


DecodingTrust: A Comprehensive Assessment of Trustworthiness in GPT Models

Neural Information Processing Systems

Generative Pre-trained Transformer (GPT) models have exhibited exciting progress in capabilities, capturing the interest of practitioners and the public alike. Yet, while the literature on the trustworthiness of GPT models remains limited, practitioners have proposed employing capable GPT models for sensitive applications to healthcare and finance – where mistakes can be costly. To this end, this work proposes a comprehensive trustworthiness evaluation for large language models with a focus on GPT-4 and GPT-3.5, considering diverse perspectives – including toxicity, stereotype bias, adversarial robustness, out-of-distribution robustness, robustness on adversarial demonstrations, privacy, machine ethics, and fairness. Based on our evaluations, we discover previously unpublished vulnerabilities to trustworthiness threats. For instance, we find that GPT models can be easily misled to generate toxic and biased outputs and leak private information in both training data and conversation history.


Towards a vision foundation model for comprehensive assessment of Cardiac MRI

Jacob, Athira J, Borgohain, Indraneel, Chitiboi, Teodora, Sharma, Puneet, Comaniciu, Dorin, Rueckert, Daniel

arXiv.org Artificial Intelligence

Cardiac magnetic resonance imaging (CMR), considered the gold standard for noninvasive cardiac assessment, is a diverse and complex modality requiring a wide variety of image processing tasks for comprehensive assessment of cardiac morphology and function. Advances in deep learning have enabled the development of state-of-the-art (SoTA) models for these tasks. However, model training is challenging due to data and label scarcity, especially in the less common imaging sequences. Moreover, each model is often trained for a specific task, with no connection between related tasks. In this work, we introduce a vision foundation model trained for CMR assessment, that is trained in a self-supervised fashion on 36 million CMR images. We then finetune the model in supervised way for 9 clinical tasks typical to a CMR workflow, across classification, segmentation, landmark localization, and pathology detection. We demonstrate improved accuracy and robustness across all tasks, over a range of available labeled dataset sizes. We also demonstrate improved few-shot learning with fewer labeled samples, a common challenge in medical image analyses. We achieve an out-of-box performance comparable to SoTA for most clinical tasks. The proposed method thus presents a resource-efficient, unified framework for CMR assessment, with the potential to accelerate the development of deep learning-based solutions for image analysis tasks, even with few annotated data available.


Interview with Bo Li: A comprehensive assessment of trustworthiness in GPT models

AIHub

Bo Li and colleagues won an outstanding datasets and benchmark track award at NeurIPS 2023 for their work DecodingTrust: A Comprehensive Assessment of Trustworthiness in GPT Models. In this interview, Bo tells us about the research, the team's methodology, and key findings. We focus on assessing the safety and risks of foundation models. In particular, we provide the first comprehensive trustworthiness evaluation platform for large language models (LLMs). Given the wide adoption of LLMs, it is critical to understand their safety and risks in different scenarios before large deployments in the real world.


DecodingTrust: A Comprehensive Assessment of Trustworthiness in GPT Models

Wang, Boxin, Chen, Weixin, Pei, Hengzhi, Xie, Chulin, Kang, Mintong, Zhang, Chenhui, Xu, Chejian, Xiong, Zidi, Dutta, Ritik, Schaeffer, Rylan, Truong, Sang T., Arora, Simran, Mazeika, Mantas, Hendrycks, Dan, Lin, Zinan, Cheng, Yu, Koyejo, Sanmi, Song, Dawn, Li, Bo

arXiv.org Artificial Intelligence

Generative Pre-trained Transformer (GPT) models have exhibited exciting progress in their capabilities, capturing the interest of practitioners and the public alike. Yet, while the literature on the trustworthiness of GPT models remains limited, practitioners have proposed employing capable GPT models for sensitive applications such as healthcare and finance -- where mistakes can be costly. To this end, this work proposes a comprehensive trustworthiness evaluation for large language models with a focus on GPT-4 and GPT-3.5, considering diverse perspectives -- including toxicity, stereotype bias, adversarial robustness, out-of-distribution robustness, robustness on adversarial demonstrations, privacy, machine ethics, and fairness. Based on our evaluations, we discover previously unpublished vulnerabilities to trustworthiness threats. For instance, we find that GPT models can be easily misled to generate toxic and biased outputs and leak private information in both training data and conversation history. We also find that although GPT-4 is usually more trustworthy than GPT-3.5 on standard benchmarks, GPT-4 is more vulnerable given jailbreaking system or user prompts, potentially because GPT-4 follows (misleading) instructions more precisely. Our work illustrates a comprehensive trustworthiness evaluation of GPT models and sheds light on the trustworthiness gaps. Our benchmark is publicly available at https://decodingtrust.github.io/; our dataset can be previewed at https://huggingface.co/datasets/AI-Secure/DecodingTrust; a concise version of this work is at https://openreview.net/pdf?id=kaHpo8OZw2.


Evaluating Large Language Models: A Comprehensive Survey

Guo, Zishan, Jin, Renren, Liu, Chuang, Huang, Yufei, Shi, Dan, Supryadi, null, Yu, Linhao, Liu, Yan, Li, Jiaxuan, Xiong, Bojian, Xiong, Deyi

arXiv.org Artificial Intelligence

Large language models (LLMs) have demonstrated remarkable capabilities across a broad spectrum of tasks. They have attracted significant attention and been deployed in numerous downstream applications. Nevertheless, akin to a double-edged sword, LLMs also present potential risks. They could suffer from private data leaks or yield inappropriate, harmful, or misleading content. Additionally, the rapid progress of LLMs raises concerns about the potential emergence of superintelligent systems without adequate safeguards. To effectively capitalize on LLM capacities as well as ensure their safe and beneficial development, it is critical to conduct a rigorous and comprehensive evaluation of LLMs. This survey endeavors to offer a panoramic perspective on the evaluation of LLMs. We categorize the evaluation of LLMs into three major groups: knowledge and capability evaluation, alignment evaluation and safety evaluation. In addition to the comprehensive review on the evaluation methodologies and benchmarks on these three aspects, we collate a compendium of evaluations pertaining to LLMs' performance in specialized domains, and discuss the construction of comprehensive evaluation platforms that cover LLM evaluations on capabilities, alignment, safety, and applicability. We hope that this comprehensive overview will stimulate further research interests in the evaluation of LLMs, with the ultimate goal of making evaluation serve as a cornerstone in guiding the responsible development of LLMs. We envision that this will channel their evolution into a direction that maximizes societal benefit while minimizing potential risks. A curated list of related papers has been publicly available at https://github.com/tjunlp-lab/Awesome-LLMs-Evaluation-Papers.


Deep learning for next-generation sleep diagnostics

#artificialintelligence

Currently, the diagnosis of sleep disorders relies on polysomnographic recordings with a time-consuming manual analysis with low reliability between different manual scorers. Throughout the night, sleep stages are identified manually in non-overlapping 30-second epochs starting from the onset of the recording based on electroencephalography (EEG), electro-oculography (EOG), and chin electromyography (EMG) signals which require meticulous placement of electrodes. Moreover, the diagnosis of many sleep disorders relies on outdated guidelines. When assessing the severity of obstructive sleep apnea (OSA), the patients are classified based on thresholds of the apnea-hypopnea index (AHI), i.e. the number of respiratory disruptions during sleep. These thresholds are not fully based on solid scientific evidence and remain the same across different measurement techniques.


[24]7.ai Earns Top Score in Opus Research's Decision Makers' Guide to Enterprise Intelligent Assistants Report 2019 Edition Markets Insider

#artificialintelligence

The 2019 edition of Opus Research's Decision Makers' Guide to Enterprise Intelligent Assistants report determined [24]7 AIVA to be a top solution for enterprises, and the only virtual agent solution capable of delivering across a breadth of simple FAQs to complex, conversational issues to online transactions. The Opus report presents a comprehensive assessment of 16 enterprise-grade Intelligent Assistant solution providers, with a focus on natural language processing, machine learning, AI, analytics and customer management integration to power digital self-service solutions. The report highlights [24]7 AIVA's ability to support both voice and digital channels and deliver unified self-service, calling out the company's differentiators as being a unique blend of AI and human insights, two decades of unparalleled experience in customer journeys across all channels, and proprietary insights including more than 150 patents and patent applications. "We analyzed a short-list of the leading providers in natural language processing, machine learning, AI and analytics to develop the industry's most comprehensive assessment of today's virtual agents and digital self-service solutions," said Dan Miller, lead analyst, Opus Research. An agent can take over a bot conversation at any time, and hand the conversation back to the bot to complete the interactions.